Skip to content

[DistInf] Enable SGLang disagg with MoRI-io & MoRI-EP for all existing models#154

Open
lcskrishna wants to merge 1 commit intoROCm:developfrom
lcskrishna:csrikris-pr222
Open

[DistInf] Enable SGLang disagg with MoRI-io & MoRI-EP for all existing models#154
lcskrishna wants to merge 1 commit intoROCm:developfrom
lcskrishna:csrikris-pr222

Conversation

@lcskrishna
Copy link
Copy Markdown
Contributor

@lcskrishna lcskrishna commented Apr 24, 2026

Motivation

This PR aims to refactor MoRI IO & MoRI EP solution to SGLang disagg models.

Summary:

  • Introduced DP_MODE=1 / 0 to switch between TP solution and DP solution. (Default 0)
  • Introduced RUN_MORI=1 to switch to mori solutions. This will be added as default in future. (Default=0)
  • models.yaml - Contains model information and related arguments that are required for enabling with MoRI IO & MoRI EP.
  • sglang_disagg_mori_io_ep.sh - contains scripts to run MoRI-IO & EP solutions for the models present in yaml file.
  • Some basic modifications in sglang_disagg_mori_io_ep.sh

Test Plan

export DOCKER_IMAGE_NAME=DOCKER_IMAGE;
export xP=1;
export yD=2;
export MODEL_NAME=DeepSeek-R1;
export MODEL_DIR=/shared_inference/models/deepseek-ai/;
export RUN_MORI=1;

echo "Submitting with xP=${xP} yD=${yD}"
NODE_LIST=<NODELIST>
sbatch -N 4 -n 4 -p amd-rccl --nodelist=${NODE_LIST} run_xPyD_models.slurm;

Submission Checklist

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors the SGLang disaggregated (prefill/decode) launch flow to support MoRI-IO + MoRI-EP across existing models, adding a DP-vs-TP switch and a YAML-based model flag catalog.

Changes:

  • Added sglang_disagg_mori_io_ep.sh (MoRI transfer backend launcher) with DP_MODE support and YAML-driven per-model flags.
  • Updated run_xPyD_models.slurm to optionally route runs through the MoRI launcher (RUN_MORI=1), add DP-mode validation, and pass additional env knobs.
  • Added MoRI environment defaults (mori_ep_env.sh) and a model catalog (models.yaml); updated the disagg inference Dockerfile for a newer base and NIC backend selection.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
scripts/sglang_disagg/sglang_disagg_mori_io_ep.sh New MoRI-based disaggregated launcher with DP/TP mode selection, log-based readiness gating, and YAML model configuration.
scripts/sglang_disagg/run_xPyD_models.slurm Adds MoRI entrypoint selection, DP-mode allowlist validation, and passes new MoRI-related env vars into containers.
scripts/sglang_disagg/mori_ep_env.sh Introduces MoRI/SGLang multi-node environment defaults and DP-mode tuning knobs.
scripts/sglang_disagg/models.yaml Centralizes per-model CLI flags for TP vs DP launch modes.
docker/sglang_disagg_inference.ubuntu.amd.Dockerfile Updates base image and adds NIC backend build-time selection logic for MoRI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +79 to 99
_run_mori="${RUN_MORI:-0}"
if [[ "$_run_mori" == "1" ]]; then
if model_allows_mori_ep "$MODEL_NAME"; then
RUN_FILE="sglang_disagg_mori_io_ep.sh"
echo "RUN_MORI=1: using $RUN_FILE for model '$MODEL_NAME'"
else
echo "Error: RUN_MORI=1 but MODEL_NAME '$MODEL_NAME' is not in MORI_EP_VALID_MODELS" >&2
printf "MoRI EP allowed models:\n" >&2
for m in "${MORI_EP_VALID_MODELS[@]}"; do
printf " - %s\n" "$m" >&2
done
exit 1
fi
if [[ ! -f "$RUN_FILE" ]]; then
echo "Error: RUN_MORI=1 requires '$RUN_FILE' in $(pwd)." >&2
exit 1
fi
else
echo "Error: Model '$MODEL_NAME' not found in MODEL_RUNFILES"
echo "Available models: ${!MODEL_RUNFILES[@]}"
exit 1
RUN_FILE="sglang_disagg_server.sh"
echo "RUN_MORI not set: using $RUN_FILE"
fi
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says RUN_MORI “will be added as default”, but here _run_mori defaults to 0 and the script explicitly logs “RUN_MORI not set”. If the intended behavior is to enable MoRI by default, consider flipping the default to RUN_MORI=1 (or clarify the PR description / add a TODO noting it’s intentionally not default yet).

Copilot uses AI. Check for mistakes.
Comment on lines +92 to +95
if [[ ! -f "$RUN_FILE" ]]; then
echo "Error: RUN_MORI=1 requires '$RUN_FILE' in $(pwd)." >&2
exit 1
fi
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RUN_FILE is checked as a relative path (-f "$RUN_FILE") and later used to build RUN_FILE_FULL based on MOONCAKE_REPO_DIR=$(pwd). In Slurm jobs the working directory depends on where sbatch was invoked, so submitting from outside this folder can break unexpectedly. Consider resolving paths relative to the script location (e.g., SCRIPT_DIR=$(cd "$(dirname "$0")" && pwd)) and basing MOONCAKE_REPO_DIR / RUN_FILE_FULL on that.

Copilot uses AI. Check for mistakes.
Comment on lines 57 to 66
# Define valid model names
VALID_MODELS=( \
"Qwen3-32B" \
"Mixtral-8x7B-v0.1" \
"Mixtral-8x7B-Instruct-v0.1" \
"Llama-3.1-8B-Instruct" \
"Llama-3.1-405B-Instruct-FP8-KV" \
"amd-Llama-3.3-70B-Instruct-FP8-KV" \
"DeepSeek-V3" \
"DeepSeek-R1" \
)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model naming is now Mixtral-8x7B-Instruct-v0.1 in VALID_MODELS, but other launch helpers in this directory still reference Mixtral-8x7B-v0.1 (e.g., scripts/sglang_disagg/salloc_launch.sh:8 and scripts/sglang_disagg/sglang_disagg_server.sh model config maps). This inconsistency will cause validation failures or unexpected defaults depending on which entrypoint is used; consider updating the other scripts or supporting both aliases here.

Copilot uses AI. Check for mistakes.
Comment on lines +349 to +353
set -x
eval "$ROUTER_CMD" 2>&1 | tee "/run_logs/${SLURM_JOB_ID:-0}/proxy_NODE${NODE_RANK}.log" >/dev/null &
set +x
proxy_pid=$!
echo "Router (sglang_router) started pid=${proxy_pid} (DP_MODE=${DP_MODE})"
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proxy_pid=$! here captures the PID of tee (the last process in the backgrounded pipeline), not the python3 -m sglang_router.launch_router process. As a result, kill "${proxy_pid}" may not actually stop the router, leaving it running and keeping port 2322 bound. Prefer launching without a pipe (redirect directly to a log file), or use a process-group/job-control approach so you can reliably terminate the router process.

Suggested change
set -x
eval "$ROUTER_CMD" 2>&1 | tee "/run_logs/${SLURM_JOB_ID:-0}/proxy_NODE${NODE_RANK}.log" >/dev/null &
set +x
proxy_pid=$!
echo "Router (sglang_router) started pid=${proxy_pid} (DP_MODE=${DP_MODE})"
proxy_log="/run_logs/${SLURM_JOB_ID:-0}/proxy_NODE${NODE_RANK}.log"
set -x
bash -lc 'exec ${ROUTER_CMD}' >"${proxy_log}" 2>&1 &
set +x
proxy_pid=$!
echo "Router (sglang_router) started pid=${proxy_pid} (DP_MODE=${DP_MODE}, log=${proxy_log})"

Copilot uses AI. Check for mistakes.
echo ""
} | tee "$PREFILL_LOG"
set -x
eval "$PREFILL_CMD" 2>&1 | tee -a "$PREFILL_LOG" >/dev/null &
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same pipeline PID issue as the router: prefill_pid=$! is the PID of tee, not the actual sglang.launch_server process, so the subsequent kill "${prefill_pid}" may not stop the server. Consider redirecting logs without tee, or track/kill the real server PID (e.g., avoid pipelines or kill the process group).

Suggested change
eval "$PREFILL_CMD" 2>&1 | tee -a "$PREFILL_LOG" >/dev/null &
eval "$PREFILL_CMD" >>"$PREFILL_LOG" 2>&1 &

Copilot uses AI. Check for mistakes.
Comment on lines +83 to +87
git clone --branch v62.0 --depth 1 https://github.com/linux-rdma/rdma-core.git /tmp/rdma-core && \
cd /tmp/rdma-core && \
mkdir -p build && cd build && \
cmake -GNinja -DCMAKE_INSTALL_PREFIX=/usr -DNO_MAN_PAGES=1 .. && \
ninja && ninja install && ldconfig && rm -rf /tmp/rdma-core
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rdma-core source clone/build/install runs unconditionally on every launch. This adds significant startup time and introduces a hard dependency on outbound network access and build tooling inside the container; if GitHub is unavailable the entire job will fail. Consider moving this into the Docker image (preferred), or gating it behind an opt-in env var and/or a “already installed” check before rebuilding.

Suggested change
git clone --branch v62.0 --depth 1 https://github.com/linux-rdma/rdma-core.git /tmp/rdma-core && \
cd /tmp/rdma-core && \
mkdir -p build && cd build && \
cmake -GNinja -DCMAKE_INSTALL_PREFIX=/usr -DNO_MAN_PAGES=1 .. && \
ninja && ninja install && ldconfig && rm -rf /tmp/rdma-core
## Prefer baking this into the image; runtime install is opt-in to avoid
## unconditional startup cost and hard dependency on outbound network access.
INSTALL_RDMA_CORE_AT_RUNTIME="${INSTALL_RDMA_CORE_AT_RUNTIME:-0}"
if command -v ibv_devinfo >/dev/null 2>&1; then
echo "RDMA userspace already present; skipping runtime rdma-core install."
elif [[ "${INSTALL_RDMA_CORE_AT_RUNTIME}" == "1" ]]; then
git clone --branch v62.0 --depth 1 https://github.com/linux-rdma/rdma-core.git /tmp/rdma-core && \
cd /tmp/rdma-core && \
mkdir -p build && cd build && \
cmake -GNinja -DCMAKE_INSTALL_PREFIX=/usr -DNO_MAN_PAGES=1 .. && \
ninja && ninja install && ldconfig && rm -rf /tmp/rdma-core
if ! command -v ibv_devinfo >/dev/null 2>&1; then
echo "ERROR: INSTALL_RDMA_CORE_AT_RUNTIME=1 was set, but rdma-core install did not complete successfully." >&2
exit 1
fi
else
echo "Skipping runtime rdma-core install (set INSTALL_RDMA_CORE_AT_RUNTIME=1 to enable this temporary workaround)."
fi

Copilot uses AI. Check for mistakes.
Comment on lines +39 to +56
NODE_RANK="${NODE_RANK:-0}"
MODEL_PATH=$MODEL_PATH
MODEL_NAME="${MODEL_NAME:-}"
xP="${xP:-1}"
yD="${yD:-1}"
DP_MODE="${DP_MODE:-0}"
# PARALLEL_MODE is derived strictly from DP_MODE for models.yaml (tp vs dp flags).
if [[ "$DP_MODE" == "1" ]]; then
PARALLEL_MODE=dp
else
PARALLEL_MODE=tp
fi
echo "PARALLEL_MODE=${PARALLEL_MODE} (DP_MODE=${DP_MODE})"

if [[ -z "${MODEL_NAME:-}" ]]; then
echo "ERROR: MODEL_NAME not set, exiting"
exit 1
fi
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MODEL_PATH=$MODEL_PATH will silently set MODEL_PATH to an empty string if it wasn't provided, and the script never validates that it is non-empty before using --model-path ${MODEL_PATH}. Add an explicit check (similar to MODEL_NAME) and use a safe default/assignment (e.g., MODEL_PATH="${MODEL_PATH:-}") so misconfiguration fails fast with a clear error.

Copilot uses AI. Check for mistakes.
RUN pip install --upgrade sglang-router

# the default already installs mori with AINIC. Reinstall from source for cx7 and bnxt; ainic only clears USE_* overrides.
ARG NIC_BACKEND="cx7"
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a Dockerfile, ARG default values are taken literally (quotes are not stripped). With ARG NIC_BACKEND="cx7", the value becomes "cx7", so the later case "${NIC_BACKEND}" in cx7) will not match and will fall into the error branch. Use ARG NIC_BACKEND=cx7 (no quotes), and document quoting in the docker build --build-arg example if needed.

Suggested change
ARG NIC_BACKEND="cx7"
ARG NIC_BACKEND=cx7

Copilot uses AI. Check for mistakes.
Comment on lines +527 to +531
set -x
eval "$DECODE_CMD" 2>&1 | tee -a "$DECODE_LOG" >/dev/null &
set +x
decode_pid=$!

Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same pipeline PID issue as the router/prefill: decode_pid=$! captures tee's PID, not the sglang.launch_server PID, so kill "${decode_pid}" may fail to stop the decode server cleanly. Recommend launching the server without a pipeline or killing the process group so the real server process is terminated.

Copilot uses AI. Check for mistakes.
Comment on lines +138 to +179
export MODELS_YAML MODEL_NAME PARALLEL_MODE
eval "$(python3 - <<'PY'
import os
import shlex
import sys
import yaml

config_path = os.environ["MODELS_YAML"]
model_name = os.environ["MODEL_NAME"]
mode = os.environ["PARALLEL_MODE"]

with open(config_path, "r", encoding="utf-8") as f:
models = yaml.safe_load(f) or {}

if model_name not in models:
print(f'echo "ERROR: Model {model_name} not found in {config_path}"; exit 1')
sys.exit(0)

cfg = models[model_name] or {}
prefill = cfg.get("prefill", {}) or {}
decode = cfg.get("decode", {}) or {}


def q(v):
return shlex.quote(str(v if v is not None else ""))


exports = {
"MODEL_BASE_FLAGS": cfg.get("base_flags", ""),
"MODEL_MODE_FLAGS": cfg.get(f"{mode}_flags", ""),
"MODEL_PREFILL_FLAGS": prefill.get(mode, ""),
"MODEL_DECODE_FLAGS": decode.get(mode, ""),
"MODEL_EXPERIMENTAL_FLAGS": cfg.get("experimental_flags", ""),
}

for key, value in exports.items():
print(f"{key}={q(value)}")
PY
)"

PREFILL_MODEL_CONFIG="${MODEL_BASE_FLAGS} ${MODEL_MODE_FLAGS} ${MODEL_PREFILL_FLAGS} ${MODEL_EXPERIMENTAL_FLAGS}"
DECODE_MODEL_CONFIG="${MODEL_BASE_FLAGS} ${MODEL_MODE_FLAGS} ${MODEL_DECODE_FLAGS} ${MODEL_EXPERIMENTAL_FLAGS}"
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses eval to ingest YAML-provided flag strings and later eval again to execute the assembled command. Because the YAML values can include shell expansions (and the current YAML intentionally contains $(seq ...)), this effectively allows arbitrary shell execution via models.yaml. If you only need list expansion for flags like --cuda-graph-bs, prefer representing them as explicit space-separated values in YAML (no $(...)) and build the python3 argv as a Bash array to avoid eval entirely.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants